Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57
Open
fanny-riols wants to merge 9 commits intomainfrom
Open
Decompose response_speed into response_speed_with_tool_calls and response_speed_no_tool_calls#57fanny-riols wants to merge 9 commits intomainfrom
fanny-riols wants to merge 9 commits intomainfrom
Conversation
…etrics Splits the existing response_speed diagnostic metric into two filtered variants based on whether the assistant made a tool call in the turn. Parses conversation_trace to map each latency to its turn and checks for tool_call entries on that turn_id. Shared logic (sanity filtering, mean/max, MetricScore construction) is extracted into a _ResponseSpeedBase class; each variant only implements _get_latencies(). Bumps metrics_version to 0.1.2.
…DEL_LIST When restoring redacted secrets in apply_env_overrides, skip deployments that are not present in the current environment's EVA_MODEL_LIST rather than raising a ValueError. Only raise if the missing deployment is the active LLM for this run. This allows metrics-only reruns in environments that don't have every deployment from the original run configured.
Adds _resolve_path() helper that returns the stored path if it exists on disk, otherwise falls back to output_dir/<filename>. Used in _build_history for pipecat_logs.jsonl and elevenlabs_events.jsonl so that metric reruns work correctly when a run directory has been moved from its original location.
…in analysis app Adds both new metrics to _NON_NORMALIZED_METRICS so they are rendered as standalone seconds bar charts alongside response_speed. Category grouping, color, and table sorting are handled dynamically via the metric registry.
…onse speed metrics The filtered variants now read metrics/turn_taking/details/per_turn_latency from the record's metrics.json instead of using context.response_speed_latencies. This gives a direct turn_id → latency mapping, avoiding the index-based alignment that was previously needed to correlate latencies with tool calls. The base response_speed metric is unchanged (still uses UserBotLatencyObserver).
…NoToolCallsMetric Tests cover: missing output_dir, missing metrics.json, missing turn_taking data, no tool-call turns, all tool-call turns, mixed turns (correct split), invalid latency filtering, and an exhaustiveness check that with_tool + no_tool latencies together equal the full per_turn_latency set.
gabegma
reviewed
Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a with/without tool call breakdown to the
response_speedmetric so latency can be compared between turns that required a tool call and turns that didn't.Instead of separate registered metrics, the breakdown is computed as sub-fields within
response_speed:turn_takingper_turn_latency(keyed byturn_id), replacing the previous use of Pipecat'sUserBotLatencyObserverdata.conversation_traceis checked to classify each turn as with/without tool calls.response_speeddetailsdict gains two optional sub-dicts:with_tool_callsandno_tool_calls, each withmean_speed_seconds,max_speed_seconds,num_turns, andper_turn_speeds.Example results across 150 records:
response_speed(all turns)with_tool_callsno_tool_callsTool-call turns were ~3.2 s slower on average in this example.
Metrics summary
metrics_summary.json→per_metric.response_speednow includes nestedwith_tool_callsandno_tool_callsaggregate entries (mean/min/max/count).Analysis app
response_speed_with_tool_callsandresponse_speed_no_tool_callsappear as columns in the Diagnostic table, next toresponse_speed, in both the single-run and cross-run views.Also included
apply_env_overrides: deployments with redacted secrets that aren't in the currentEVA_MODEL_LISTnow warn-and-skip instead of raising, as long as they aren't the active LLM. This allows metrics-only reruns in environments that don't have every deployment from the original run configured._build_history: added_resolve_path()sopipecat_logs.jsonl/elevenlabs_events.jsonlfall back tooutput_dir/<filename>when the stored path no longer exists — fixes metric reruns after a run directory is moved.